Skip to content

Conversation

@thomasw21
Copy link
Member

@thomasw21 thomasw21 commented Aug 26, 2021

We've noticed that 13B training crashed.

The logs suggest that it's linked to the new carbon feature:

File "/gpfsssd/worksf/projects/rech/six/commun/code/tr1-13B/Megatron-DeepSpeed-tr1-13B/megatron/global_vars.py", line 169, in _set_codecarbon_tracker
  _set_codecarbon_tracker(args)
 File "/gpfsssd/worksf/projects/rech/six/commun/code/tr1-13B/Megatron-DeepSpeed-tr1-13B/megatron/global_vars.py", line 169, in _set_codecarbon_tracker
  _set_codecarbon_tracker(args)
   File "/gpfsssd/worksf/projects/rech/six/commun/code/tr1-13B/Megatron-DeepSpeed-tr1-13B/megatron/global_vars.py", line 169, in _set_codecarbon_tracker
_set_codecarbon_tracker(args)
 File "/gpfsssd/worksf/projects/rech/six/commun/code/tr1-13B/Megatron-DeepSpeed-tr1-13B/megatron/global_vars.py", line 169, in _set_codecarbon_tracker
  Path(output_dir).mkdir(parents=True, exist_ok=True)
   File "/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/pathlib.py", line 1042, in __new__
Path(output_dir).mkdir(parents=True, exist_ok=True)
 File "/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/pathlib.py", line 1042, in __new__
  Path(output_dir).mkdir(parents=True, exist_ok=True)
 File "/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/pathlib.py", line 1042, in __new__
  Path(output_dir).mkdir(parents=True, exist_ok=True)
 File "/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/pathlib.py", line 1042, in __new__
  self = cls._from_parts(args, init=False)
   File "/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/pathlib.py", line 683, in _from_parts
self = cls._from_parts(args, init=False)
 File "/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/pathlib.py", line 683, in _from_parts
  self = cls._from_parts(args, init=False)
self = cls._from_parts(args, init=False)
 File "/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/pathlib.py", line 683, in _from_parts
 File "/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/pathlib.py", line 683, in _from_parts
  drv, root, parts = self._parse_args(args)
   File "/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/pathlib.py", line 667, in _parse_args
drv, root, parts = self._parse_args(args)
 File "/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/pathlib.py", line 667, in _parse_args
  drv, root, parts = self._parse_args(args)
 File "/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/pathlib.py", line 667, in _parse_args
  drv, root, parts = self._parse_args(args)
 File "/gpfswork/rech/six/commun/conda/tr1-13B/lib/python3.8/pathlib.py", line 667, in _parse_args
  a = os.fspath(a)
TypeError:   expected str, bytes or os.PathLike object, not NoneTypea = os.fspath(a)

I'm pretty sure it's because we don't check the value of args.codecarbon_dir, it defaults to None using the argument parser.

Instead we need to check that that value is not None.

I don't think the fix is urgent since we'll be running with the option activated from now on. (though it might impact smaller experiments)

@stas00 stas00 merged commit e96df7d into main Aug 26, 2021
stas00 pushed a commit that referenced this pull request Aug 26, 2021
@thomasw21 thomasw21 deleted the fix_carbon_implem branch August 26, 2021 15:07
@stas00
Copy link
Contributor

stas00 commented Aug 26, 2021

Thanks a lot, @thomasw21

@stas00
Copy link
Contributor

stas00 commented Aug 26, 2021

codecarbon is not ready mlco2/codecarbon#238 that's why I was trying to disable it in the slurm script.

@stas00
Copy link
Contributor

stas00 commented Aug 26, 2021

Got a chance to test it, your fix is perfect, @thomasw21! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants